Channel semantic mutual learning for visible-thermal person re-identification

Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval issue aiming to match the same pedestrian between visible and infrared cameras. Thus, the modality discrepancy presents a significant challenge for this task. Most methods employ different networks to extract features that are invariant between modalities. While we propose a novel channel semantic mutual learning network (CSMN), which attributes the difference in semantics between modalities to the difference at the channel level, it optimises the semantic consistency between channels from two perspectives: the local inter-channel semantics and the global inter-modal semantics. Meanwhile, we design a channel-level auto-guided double metric loss (CADM) to learn modality-invariant features and the sample distribution in a fine-grained manner. We conducted experiments on RegDB and SYSU-MM01, and the experimental results validate the superiority of CSMN. Especially on RegDB datasets, CSMN improves the current best performance by 3.43% and 0.5% on the Rank-1 score and mINP value, respectively. The code is available at https://github.com/013zyj/CSMN.


Introduction
Person re-identification (ReID) [1] is a technology that employs computer vision algorithms to locate and retrieve a pre-defined individual from non-overlapping camera views.Previous studies [2][3][4][5][6][7] have mainly focused on ReID in visible light, capturing all images of a person with visible light cameras.However, visible light cameras may not capture a person's appearance at night.As a result, VI-ReID [8] is proposed.
Compared to single-modality ReID, VI-ReID faces the problem of intra-class variations, such as illumination and occlusion, and the challenge of significant modality discrepancy.Therefore, VI-ReID is more challenging.Currently, common methods for VI-ReID mainly include the following aspects: On the one hand, modality-invariant features are extracted to address the crossmodality problem [9,10].However, modality-invariant features are frequently difficult to ensure quality, leading to the loss of information in pedestrian image representations.On the other hand, using GAN methods for cross-modality transformation [11][12][13][14] can convert cross-modality matching problems into within-modality matching tasks to improve retrieval accuracy.However, such methods inevitably increase the computational complexity of the model and introduce noise, resulting in poor model performance.In addition, some work has been devoted to improving the performance of metric learning methods [15][16][17][18].But the above methods only learn the sample distribution at the instance level and lack handling of outlier samples.
To reduce the discrepancy between the channel semantics within a modality and between modalities, we designed a novel Channel Semantic Mutual Learning Network (CSMN), which simultaneously learns channel semantic consistency from two aspects: Intra-Modality Channel Semantic Mutual Learning(ICSM), which focuses on learning fine-grained information by increasing the similarity of feature distributions between channels, and Cross-Modality Channel Semantic Mutual Learning (CCSM), aiming at learning global information by reducing the distance between feature distributions across modalities.In addition, we proposed a Channellevel Auto-directed Metric Learning loss (CADM) to optimise intra-class and inter-modality feature distributions in a more fine-grained manner.Specifically, our approach reduces intraclass instance discrepancies and aggregates semantic information for the same identity while also narrowing the gap between modalities by strengthening the correlation between semantic information for the same identity across different modalities.Additionally, we designed an auto-guided function to mitigate the generation of noisy samples.Since infrared images cannot be viewed as normal RGB images, we use the gray-to-color method to convert infrared images to colored images.Fig 1 shows the overall structure of the model.
In summary, the main contributions of this paper are: We propose a channel semantic mutual learning network (CSMN) for VI-ReID that treats modality discrepancy as inter-channel discrepancy and reduces intra-modality channel discrepancy while learning inter-modal channel information to bridge modality discrepancy.
We suggest a channel-level auto-guided double metric loss (CADM) to optimise the sample distribution intra-and inter-modality through multiple aspects, including reducing the intraclass instance differences, strengthening the correlation between the same identity across different modalities, and handling outlier samples.
We have conducted numerous experiments on two benchmarks.Specifically, on SYSU-MM01, CSMN achieves state-of-the-art performance and improves the Rank-1 score of the current best performance on the RegDB dataset by 3.43%.

Single-modality ReID
Single-modality ReID attempts to retrieve a specific person from a library of images obtained from different cameras during the day, where the images obtained have the same modalities.Person re-identification (ReID) methods have greatly improved as deep learning technology has advanced.Many methods have focused on building local-based models to fully explore fine-grained features in a person's images [19][20][21].Fu et al. [10] learned local features at different scales using a pyramid structure and eventually obtained multi-scale fused features.Lian et al. [20] designed an attention-aligned network for feature learning that uses channel and spatial attention.Wang et al. [21] proposed a multi-branch network where one branch captures global representations and the other branch focuses on local information.In addition, attention models [22][23][24][25][26] are essential for designing novel network architectures that highlight salient regions and alleviate misalignment to learn robust features.

Visible-infrared person re-identification
VI-ReID is to match and identify the same pedestrian between different cameras, not different modalities.Wu et al. [8] published SYSU-MM01dataset and proposed a model to extract modality-shared person features.Dai et al. [27] suggested that cmGAN reduces the modality differences between visible and infrared images.Thus, dual-stream networks have been widely used to address modality discrepancy problems [16,28].Ye et al. [16] proposed a model to address intra-class variation caused by viewpoint non-variation.Ye et al. [28] proposed a novel DDAG learning method for VI-ReID by mining modal contextual cues.However, the methods above focus on reducing modality differences at the instance level.At the same time, this paper aims to learn more discriminative clues at the channel level, enabling semantic consistency between channels.

Metric learning
Metric learning plays a crucial role in inter-sample similarity measures for Re-ID.Ye et al. [16] provided a loss to learning discriminative feature representations using a two-stream network [29].They also introduced a major constraint to enhance performance [30].To reduce intra-and cross-modal variation, Hao et al. [31] proposed a network with classification and recognition constraints.Zhao et al. [32] introduced the hard pentaplet loss to improve VI-ReID performance.Wu et al. [33] designed a novel loss for focal modalityaware that guides inter-modal similarity learning with intra-modal similarity.However, the above methods only use the Euclidean metric, which cannot learn modality-shared discriminative features from multiple perspectives.And ignore the impact of noisy samples on model performance.

Method
In this section, as shown in Fig 1 , we introduce CSMN, which consists of three parts: Modality-specific, Modality-shared and Loss, with the Modality-specific containing two essential elements: 1) Intra-modality Channel Semantic Mutual Learning (ICSM), which reduces differences between instances by learning semantic information among channels of instances within the same modality.2) Cross-modality Channel Semantic Mutual Learning (CCSM), which learns the relationships between channels of different modalities and aggregates semantic information of the same identity at the channel level.Then the features learned from different branches go through Modality-shared for further feature learning.In terms of metric learning, we design Channel-level Auto-guided Double Metric loss (CADM), which optimizes the distribution of samples within and between modalities.

Intra-Channel Semantic Mutual Learning
RGB image channels contain different semantic information and have certain correlations.As depicted in Fig 2, modality-specific features are extracted by specific feature layers.As visible light and infrared images are captured based on different imaging principles, modality-specific features correspond to different semantic information for the same identity.Since infrared images are obtained based on the temperature distribution on the surface of objects, they cannot be treated as ordinary three-channel images.In this paper, we attribute the differences between modalities to differences between channels.Hence, the key to this problem is to ensure the identity correlation of channel features and reduce semantic changes between channels.Since the extended three-channel infrared image exhibits heterogeneity in the R/G/B channels, we aim to train the network to learn R/G/B channel distributions similar to those of visible images.To reach this goal, we made an Intra-Modality Channel Semantic Mutual Learning (ICSM) module, as shown in Fig 2, which uses the colours red, blue, and green to show how similar the channel feature distributions are to each other.Our method focuses on maximizing the intra-modality channel-level semantic consistency within each modality.We represent channel-level consistency as the logical distribution similarity between channel features.It can be formulated as follows: where L ICMC represents the semantic consistency between the three channels, and ICSM aims to minimize the semantic difference between channels.To achieve the above goals, the following formula is used to optimize the parameters θ v and θ t :

Cross-Modality Channel Semantic Mutual Learning
In addition, to reduce the differences between modalities, we propose the Cross-Modality Channel Semantic Mutual Learning (CCSM) method, which aims to maximize the intermodality channel semantic consistency.This method uses inter-modality semantic consistency to aggregate features from different modalities under the same identity.We further reduce the inter-modality channel semantic differences based on the modality-specific semantic consistency features.Since the modality-specific extractors θ v and θ t extract features within each modality, the extracted modality-specific features are independent.The following formula can represent the features of each modality: where S t and S v represent the number of samples in the infrared and visible images and represent the weights of the i-th feature vector in different modalities, which are adjusted to reduce the influence of outliers.C v and C t are batch-computed.CCSM aims to learn semantic information between modalities rather than identity information.Using metric learning enables the alignment of the distance between C v and C t , and the features θ v and θ t will have more semantic consistency information.In CCSM, the goal is to maximize the inter-modality semantic consistency between visible and infrared image features: Fig 3 shows the collaborative processing process of ICSM and CCSM.The combination of the two can not only reduce the differences between instances of the same identity within each modality and improve the matching accuracy of the same identity between modalities.

ADM
Most existing metric learning methods are at the instance level, a coarse-grained learning method that is also vulnerable to the influence of noisy samples.To learn fine-grained features and reduce the impact of noisy samples on the feature space, we propose channel-level autoguided double metric loss (CADM).We obtain semantically consistent features f v and f r from the modality-specific extractors θ v and θ t respectively.Then we use a weight-shared feature extractor θ s to obtain rich semantic features Since different metric learning methods will learn different the hardest samples.For the Euclidean metric, p1 and a in Fig 3(A) are the pair of positive samples, and n1 and a are the pair of negative samples, but for the cosine metric in Fig 3(B), p2 and a are only the pair of positive samples, and n2 and a are the pair of negative samples.So, to learn the sample distribution from multiple perspectives, we propose a double metric loss(DM), which introduces a cosine metric that takes the direction of the feature vector into account based on the Euclidean metric: Additionally, to deal with noisy samples, we introduce an auto-guided function.Specially, we utilize the Euclidean metric to calculate the similarity between samples and construct the corresponding similarity matrix.We initially extract the positions of all positive and negative samples from the distance matrix using the Euclidean metric to generate a position mask.The position mask calculates the distances between each sample in the distance matrix.All sample distances are then combined using the proposed auto-guided function.The following is the auto-guided function.
where d is the constant slope that controls the auto-guided function.δ is a very small constant that ensures the function value is greater than zero.
The CADM can finally be expressed as: Where c are the auto-guided function loss coefficients of L p .Therefore, the final expression of the loss function is as follows: 4 Experiment

Experimental settings
4.1.1Dataset and setting.We evaluate the performance of our proposed approach on the VI-ReID task through experiments on two widely used benchmark datasets, SYSU-MM01 [8] and RegDB [34].The SYSU-MM01 dataset, the largest VI-ReID dataset, comprises four visible and two near-infrared cameras.The training set consists of 22,258 visible images and 11,909 thermal images of 395 individuals.The test set has 96 distinct identities, with 3,803 thermal images used as queries and 301 visible images used as galleries.We used single-shot outdoor and indoor search modes in our experiments.The dataset's configuration details can be found in [35].The RegDB dataset consists of images captured by one visible camera and one far-infrared camera.It contains 412 identities, each represented by 20 images (10 visible and 10 infrared) per person.According to the current VI-ReID settings [36], 206 identities are chosen randomly for training, and the remaining 206 identities are allocated to the test set.

Evaluation metrics.
To assess the performance of our method, we use cumulative matching characteristics (CMC), mean average precision (mAP), and mean inverse negative penalty (mINP) [36].mAP evaluates the retrieval system's performance when a gallery set contains multiple matched images.CMC measures the probability that the top-ranked retrieval results have the correct image of the person.mAP evaluates the retrieval system's performance when a gallery set contains multiple matched images.Furthermore, mINP considers the most difficult match to calculate the amount of work for inspectors.

Implementation details.
We use CAJL [37] as the baseline network.The pre-trained weights of ImageNet are used to initialize the network parameters.We employ a PK sampling design with P = 8 and K = 4 parameters.We use zero-padded, randomly cropped images (288 × 144) as training data to supplement the original dataset.The SGD optimizer is used during the optimization process's learning phase.The learning rate is reduced from its initial value of 0.1 after 20 and 50 iterations.There are 100 training epochs in total.All tests were performed on an Nvidia 3090 GPU with PyTorch 1.6 and cuda11.0.

Ablation study
To verify the effectiveness of ICSM, CCSM, DM, and CADM, we conducted detailed experiments on the RegDB and SYSU-MM01dataset.

Effectiveness of Intra-Channel Semantic Mutual Learning (ICSM).
As shown in Table 1, taking the visible to infrared mode as an example, based on the Base model, using only L ICSM achieved a Rank-1 score of 86.5% and an mAP score of 77.25%.This improved the Rank-1 score of the baseline model by 1.47%.The experimental results show that ICSM can learn fine-grained features among channels within a modality, reducing the differences between channels within the same modality.

Effectiveness of Cross-Modality Channel Semantic Mutual Learning(CCSM).
Unlike ICSM, CCSM learns global information between modalities to aggregate samples of the same identity across modalities.As shown in Table 1, L CCSM represents the loss between modalities.Taking the visible-to-infrared mode as an example, using only L CCSM on the Base model increased the Rank-1 score and mAP score by 2.98% and 0.85%, respectively.Additionally, we observed that using both L CCSM and L ICSM on the Base model further improved the model's performance.Specifically, compared to Base + L ICSM , the Rank-1 and mAP scores improved by 1.61% and 2.81%, respectively.The Rank-1 score of Base + L CCSM was improved by 0.1%, and the mAP score was improved by 0.07%.From the experimental results, CCSM can effectively reduce the differences between modalities, and combining ICSM and CCSM further enhances the model's performance.

Effectiveness of Double Metric Loss (DM).
As shown in Table 2, we conducted a series of experiments on the methods based on Euclidean and cosine metrics to demonstrate that the double metric consisting of Euclidean and cosine metrics can effectively improve the baseline performance.It should be noted that our experiments were conducted based on Base +L ICSM +L CCSM .Baseline1 only uses the Euclidean metric, whereas Baseline2 only uses the cosine metric.The DM-based baseline model outperformed Baseline1 on the RegDB dataset by 0.53% and 0.39% on Rank-1 and mAP, respectively, and by 0.41% and 0.05% on the SYSU-MM01 dataset.The experimental results show that DM can learn the feature distribution from multiple perspectives in a fine-grained manner, thereby improving the performance of the model.

Effectiveness of Channel-level Auto-guided Double Metric Loss (CADM).
We compared our proposed method with commonly used loss functions, as shown in Table 3, and found that CADM outperformed other loss functions, specifically CELoss by 5.09% at Rank-1 and TripletLoss by 4.19% on the RegDB dataset.In addition, Rank-1 score over 1.02% of DM.On the SYSU-MM01 dataset, the Rank-1 score of the proposed method is 0.92% higher than that of DM.Experimental results show that CADM can effectively handle abnormal samples.As shown in Table 4, on the RegDB and SYSU-MM01 datasets, we tested the effect of different weights c on the loss of the auto-guided function.When c is small, experimental results show that CADM performance is poor, even worse than DM performance.The weight coefficients are so small that the model parameters do not converge sufficiently.The CADM's performance is optimal when the weight coefficient c is set to 1.In this case, CADM outperforms DM by 1.02%/2.74%on dataset RegDB on Rank-1/mAP and 0.92%/0.74%on dataset SYSU-MM01.When c exceeds 1, the CADM's performance decreases rather than increases.One possibility is that it only amplifies the gradient when it is very large while the model parameters have already been optimized to their maximum.

Visualization analysis
To demonstrate the effectiveness of our proposed method more intuitively, we use heat maps to display the features learned from pedestrian images.The heat maps of pedestrian images in different modalities are shown in Fig 5(A) and 5(B), respectively.The heat map obtained from the CSMN below focuses more on identity-related information than the heat map obtained from the baseline (CAJL [37]) network above, as shown in the figure.This suggests that the CSMN is not particularly sensitive to some distressing information (light, occlusion, etc.).As a result, the model has a high degree of generalizability.

Comparison to the state-of-the-art methods
We compared CSMN with the existing state-of-the-art VT-ReID methods on two benchmark datasets.Tables 5 and 6 show the detailed results for different evaluation metrics.As shown in Table 5, on the RegDB dataset.SCFNet [45] also designs loss functions to reduce the impact of outlier samples on the spatial features and achieves excellent retrieval accuracy.However, our method achieves better results.Specifically, the proposed method outperforms SCFNet by 3.43% in Rank-1 and 1.66% in mAP scores.For the SYSU-MM01 dataset, as shown in Table 6, CSMN achieves state-of-the-art performance.CSMN outperforms the CAJL [37] by 1.33% in Rank-1 and 0.51% in mINP in the more complicated full search mode.To alleviate the strict constraints of traditional triplet loss, the HCTri [39] method, which also improves the loss function, proposes hetero-center triplet loss.Our proposed CSMN, on the other hand, outperforms HCTri on Rank-1 and mAP by 9.53% and 10.17%, respectively.These findings imply that CSMN can effectively reduce the differences between modalities and channels within a modality.On the other hand, CADM can learn the sample distribution in a more fine-grained manner and deal with outlier samples.

Conclusion
This paper proposes a CSMN framework for visible-thermal person re-identification, which considers cross-modality differences as differences between channels.We reduce the differences between channels in two aspects: On the one hand, we propose ICSM, which learns finegrained features among channels within a modality to maximize the consistency between channels and minimize the differences between them.On the other hand, we propose CCSM, which learns global channel features between modalities to aggregate samples of the same identity across modalities.In addition, to better optimize the sample distribution between and within modalities, we propose CADM.Unlike methods that learn sample distribution at the instance level, our method fully exploits the advantages of channel consistency to learn the sample distribution in a more fine-grained manner.Moreover, we use an auto-guided function to reduce the generation of outlier samples.Our experiments on two benchmark datasets indicate that CSMN outperforms the existing state-of-the-art methods for VI-ReID.

Fig 3 .
Fig 3.The diagram depicts the single and double metric learning methods.Where C(p) and C(n) represent the cosine values between the positive and negative samples.https://doi.org/10.1371/journal.pone.0293498.g003

Furthermore, we examined
different b's effects on DM on the RegDB dataset.As shown in Fig 4, b = 0 is equivalent to using only the Euclidean metric, and the performance of DM gradually improves as b increases.When b = 1, DM performs optimally; however, as b increases, DM's performance decreases.This indicates that the two metrics have an equal impact on model performance, demonstrating their complementarity.

Fig 4 .
Fig 4. The figure shows the effect of variation in b on the DM performance on RegDB.Rank-1 is represented by the long blue bar, mAP by the long orange bar, and mINP by the long gray bar.The red "-" line represents the rank1 baseline, the green "-" line represents the mAP baseline, and the black "-" line represents the mINP baseline.https://doi.org/10.1371/journal.pone.0293498.g004

Fig 5 .
Fig 5. Heat maps extracted by the baseline network (CAJL) and CSMN are displayed on top and bottom, respectively.Note that the pedestrian images is similar but not identical to the original image and is therefore for illustrative purposes only.(a) Comparison of heat maps extracted by the DCMN and the baseline network (CAJL) in infrared modality.(b) Comparison of heat maps extracted by the CSMN and the baseline network (CAJL) in visible modality.https://doi.org/10.1371/journal.pone.0293498.g005